Red Wine Data Analysis by Josias Marcos Orlando

Requirements:

Libraries:

  • ggplot2
  • GGally
  • scales
  • memisc
  • gridExtra

Dataset

The Red Wine Quality data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Input variables (based on physicochemical tests):

  • fixed acidity (tartaric acid - g / dm^3)
  • volatile acidity (acetic acid - g / dm^3)
  • citric acid (g / dm^3)
  • residual sugar (g / dm^3)
  • chlorides (sodium chloride - g / dm^3
  • free sulfur dioxide (mg / dm^3)
  • total sulfur dioxide (mg / dm^3)
  • density (g / cm^3)
  • pH
  • sulphates (potassium sulphate - g / dm3)
  • alcohol (% by volume)

Output variable (based on sensory data):

  • quality (score between 0 and 10)

For this analysis, we are mainly looking to answer one question: Which chemical properties influence the quality of red wines?

Univariate Plots Section

To understand a little bit better the Red Wine Quality dataset, the first step is to take a look in the summary of variables contained in it. With this summary it’s possible to check how spread is the values, by checking the min and max values. It’s also possible to have a quick understanding about the the distribution of the data, by comparing the mean and the median values.

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Another interesting approach to check the data is to visualize the distribution of each feature. This can be achieved with Histogram Plots, as it is showed bellow:

Univariate Analysis

As mentioned above, the data distribution analysis can be really helpful to have a quick overview about our data and it’s boundaries.

We can classify the distributions as a:

Based on the definitions above and on the summary(mean and median), we can classify the wine properties into of the one distributions mentioned above:

Applying Log Scale to the distribution

One interest transformation in the data that helps to “normalize” a distribution it’s the application of the log scale to the data:

If we look closer to a couple variables, like the Total SO2 and density, we can see how the log scale helps to normalize the variables:

As we can see, for the Total SO2, which corresponds to a Right skewed distribution, when we apply the log function we can see the shape of a normal distribution. In the other hand, the density that corresponds to a normal distribution, doesn’t loses its shape, and we get a more smooth distribution with less bumps in the shape.

Bivariate Plots Section

Matrix Plot

To have a quick look over the correlation between two features, it’s possible to plot a matrix of plots and values. This plot is showed bellow:

Since the Matrix Plot is a little bit hard to see and the correlation numbers are sliced, I decided to generate a Scatter Plot of the Wine Quality against each feature, with the mean and median also in the plot. Also, I calculated the correlation between Wine Quality and each of the features.

Another comparision that I decided to make was a Boxplot for each of the Wine Quality against the each feature. To do this plot, I needed to add a new variable to our dataset named grade_number which corresponds to a categorical variable. This helped me to see the variaton of the data for each of the Wine Quality.

The plots and calculation are showed bellow:

Variable Analysis: Fixed Acidity

In this section, we are going to analyse the variable Fixed Acidity against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

We got a small correlation between the variables, while the scatter plot and the boxplot shows that the data is really spread as expected because of the value for the correlation that was obtained.

Variable Analysis: Volatile Acidity

In this section, we are going to analyse the variable Volatile Acidity against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

We got a medium negative correlation between the variables, while the scatter plot and the boxplot shows that the data is not so spread, as expected because from the value for the correlation that was obtained.

Variable Analysis: Citric Acid

In this section, we are going to analyse the variable Citric Acid against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

We got a small to medium correlation between the variables. The boxplot shows that the data is considerable spread.

Variable Analysis: Residual Sugar

In this section, we are going to analyse the variable Residual Sugar against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

We got a very small correlation between the variables. The scatter plot and box plot shows that between this two variables there isn’t a big variation in the data, even that there are several outliers in the data.

Variable Analysis: Chlorides

In this section, we are going to analyse the variable Chlorides against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

We got a small correlation between the variables. The scatter plot and box plot shows that between this two variables there isn’t a big variation in the data, even that there are several outliers in the data.

Variable Analysis: Free Sulfur Dioxide

In this section, we are going to analyse the variable Free Sulfur Dioxide against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606

We got a very small correlation between the variables. The scatter plot and box plot shows that between this two variables there’s a good amount of variation in the data.

Variable Analysis: Total Sulfur Dioxide

In this section, we are going to analyse the variable Total Sulfur Dioxide against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

We got a small correlation between the variables.

Variable Analysis: Density

In this section, we are going to analyse the variable Density against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

We got a small correlation between the variables. The scatter plot and box plot shows that between this two variables a few negative and positive outliers in the data.

Variable Analysis: pH

In this section, we are going to analyse the variable pH against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

We got a very small correlation between the variables.

Variable Analysis: Sulphates

In this section, we are going to analyse the variable Sulphates against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

We got a small correlation between the variables. The scatter plot and box plot shows a small positive correlation between the variables.

Variable Analysis: Alcohol

In this section, we are going to analyse the variable Alcohol against the Wine’s quality. First we calculate the correlation between the two variables followed by a Scatter plot with a smoothed means and a box plot to analyse the data variation.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

We got a medium correlation between the variables. The scatter plot and box plot shows that between this two variables there a medium Positive correlation. This is the strongest correlation between a feature and the wine’s quality that we got in the dataset.

Bivariate Analysis

The results obtained from the correlation analysis, between each feature and the Quality, were:

Feature Correlation Orientation Strength
fixed.acidity 0.124 Positive Very Weak
volatile.acidity -0.3905 Negative Weak
citric.acid 0.2264 Positive Weak
residual.sugar 0.0137 Positive Very Weak
chlorides -0.1289 Negative Very Weak
free.sulfur.dioxide -0.0506 Negative Very Weak
total.sulfur.dioxide -0.1851 Negative Very Weak
density -0.1749 Negative Very Weak
pH -0.0577 Negative Very Weak
sulphates 0.2514 Positive Weak
alcohol 0.4762 Positive Medium

Table Interpretation: - Strength: Very Weak(0 ~ 0.2), Weak(0.21~0.4), Medium(0.41~0.6), Strong(0.61~0.8) , Very Strong(0.8~1.0); - Orientation: Positive, Negative;

Conclusion:

From the results we can highlight the correlations between Quality and Alcohol, Quality and Volatile Acidity, Quality and Sulphates, and Quality and Citric Acid.The other correlations have really small values, which indicates that don’t have a big impact in the Wine Quality result.

Multivariate Plots Section

In this section, it’s necessary to generate more complex plots, by adding color to the points. This adds a new layer and open the path to the analysis of three variables, instead of only two as was did in the previous plots. For this analysis, we’ll use the variables that had the bigger values for the correlation with the Wine Quality, which from the table above are, alcohol, volatile.acidity, sulphates and citric.acid.

alcohol X volatile.acid X quality:

Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see alcohol X volatile.acid X quality:

## $title
## [1] "Alcohol x Volatile Acidity by Quality color"
## 
## attr(,"class")
## [1] "labels"
## 
##  Pearson's product-moment correlation
## 
## data:  wines$volatile.acidity and wines$alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2488416 -0.1548020
## sample estimates:
##       cor 
## -0.202288

alcohol X sulphates X quality:

Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see alcohol X sulphates X quality:

## 
##  Pearson's product-moment correlation
## 
## data:  wines$sulphates and wines$alcohol
## t = 3.7568, df = 1597, p-value = 0.0001783
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04477906 0.14196454
## sample estimates:
##        cor 
## 0.09359475

alcohol X citric.acid X quality:

Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see alcohol X citric.acid X quality:

## 
##  Pearson's product-moment correlation
## 
## data:  wines$citric.acid and wines$alcohol
## t = 4.4188, df = 1597, p-value = 1.059e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06121189 0.15807276
## sample estimates:
##       cor 
## 0.1099032

volatile.acidity X sulphates X quality:

Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see volatile.acidity X sulphates X quality:

## 
##  Pearson's product-moment correlation
## 
## data:  wines$sulphates and wines$volatile.acidity
## t = -10.804, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3060917 -0.2147125
## sample estimates:
##        cor 
## -0.2609867

volatile.acidity X citric.acid X quality:

Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see volatile.acidity X citric.acid X quality:

## 
##  Pearson's product-moment correlation
## 
## data:  wines$citric.acid and wines$volatile.acidity
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

sulphates X citric.acid X quality:

Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see sulphates X citric.acid X quality:

## 
##  Pearson's product-moment correlation
## 
## data:  wines$citric.acid and wines$sulphates
## t = 13.159, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2678558 0.3563278
## sample estimates:
##     cor 
## 0.31277

The results of the correlation between the analised variables are:

Correlation alcohol volatile.acidity sulphates citric.acid
alcohol - -0.2023 0.0936 0.1099
volatile.acidity -0.2023 - -0.2609 -0.5525
sulphates 0.0936 -0.2609 - 0.31277
citric.acid 0.1099 -0.5525 0.31277 -

Multivariate Analysis


Final Plots and Summary

Plot One

This plot analyzes the variable Volatile Acidity against the Wine’s quality. This plots was choosen because it shows de correlation of the two variables and the scattered means of this relation.

Description One

This plot showed the negative correlation between the volatile.acidity and the wine quality. This correlation was also obtained in a previous session of this work.

Plot Two

In this plot we analyse dthe variable Alcohol against the Wine’s quality. This plot is very important because it shows the relation of the most important feature in the dataset to the quality variable.

Description Two

This plot showed the correlation between the feature alcohol and the wine quality. This plot was really important to verify the correlation between the variable and the quality, with the biggest value of the correlation. So from the analysis of this plot was possible to see that this was the strongest relation.

Plot Three

Multivariate analysis of the variables of interest that were selected after the calculation of the correlation. In this plot we see volatile.acidity X citric.acid X quality:

Description Three

This was the most unexpected result that I got in the analysis. Turns out that the lower the value for the volatile.acidity, the bigger is the quality and the citric.acid value.


Reflection

The hardest part of this analysis was that there’s no clear relation between one or two features with the wine quality, and because it’s hard to answer the question that was made in the beggining of this project:
Which chemical properties influence the quality of red wines?

Even that this was a hard question to answer, it was possible to see that the feature more related to wine quality was the alcohol. It has the biggest correlation value and the plots showed that, in general, the bigger is the value of the alcohol, the better the wine is considered.

Also, it’s important to say that the variables volatile.acidity, sulphates and citric.acid have some degree of influence on the wine quality.

For instance, these 4 features mentioned above, never were in my thoughts as the more relevant to the wine quality. I thought it would be pH and residual sugar the ones that actually had an effect in the perception of the wine’s quality. To me this showed how important is Exploratory Data Analysis to have a clear undertanding of informations and that guesses can be completely wrong from the actual information contained in data.

Personally, this project was a great challenge to me. I never used R before this class, but I got the hang of it. The challenge to know how to use this language in a way that I could extract meaningfull information from data in a way that I could show to people that don’t understand about Computing. It was hard to decide which plot would be better in each case, but I’m satisfied with my work and more confident that I can actually work with it.

For future work, I’m thinking on starting Financial Data about the stocks market. It’s a topic that calls my attention and that I think Data Science can be a really powerfull tool.